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Abstract 

Scoring  rules  assess  the  quality  of  probabilistic  forecasts,  by  assigning  a  numerical 
score  based  on  the  forecast  and  on  the  event  or  value  that  materializes.  A  scoring  rule 
is  proper  if  the  forecaster  maximizes  the  expected  score  for  an  observation  drawn  from 
the  distribution  F  if  she  issues  the  probabilistic  forecast  F,  rather  than  G  ^  F.  It  is 
strictly  proper  if  the  maximum  is  unique.  In  prediction  problems,  proper  scoring  rules 
encourage  the  forecaster  to  make  careful  assessments  and  to  be  honest.  In  estimation 
problems,  strictly  proper  scoring  rules  provide  attractive  loss  and  utility  functions  that 
can  be  tailored  to  the  scientific  problem  at  hand. 

This  paper  reviews  and  develops  the  theory  of  proper  scoring  rules  on  general  prob¬ 
ability  spaces,  and  proposes  and  discusses  examples  thereof.  Proper  scoring  rules  derive 
from  convex  functions  and  relate  to  information  measures,  entropy  functions  and  Breg- 
man  divergences.  In  the  case  of  categorical  variables,  we  prove  a  rigorous  version  of  the 
Savage  representation.  Examples  of  scoring  rules  for  probabilistic  forecasts  in  the  form 
of  predictive  densities  include  the  logarithmic,  spherical,  pseudospherical  and  quadratic 
scores.  The  continuous  ranked  probability  score  applies  to  probabilistic  forecasts  that 
take  the  form  of  predictive  cumulative  distribution  functions.  It  generalizes  the  absolute 
error  and  forms  a  special  case  of  a  new  and  very  general  type  of  score,  the  energy  score. 
Like  many  other  scoring  rules,  the  energy  score  admits  a  representation  in  terms  of  neg¬ 
ative  definite  functions,  with  links  to  inequalities  of  Hoeffding  type,  in  both  univariate 
and  multivariate  settings.  Proper  scoring  rules  for  quantile  and  interval  forecasts  are 
also  discussed.  We  relate  proper  scoring  rules  to  Bayes  factors  and  to  cross-validation, 
and  propose  a  novel  form  of  cross-validation,  random-fold  cross-validated  likelihood. 

A  case  study  on  probabilistic  weather  forecasts  in  the  North  American  Pacific  North¬ 
west  illustrates  the  importance  of  propriety.  We  note  optimum  score  approaches  to  point 
and  quantile  estimation,  and  propose  the  intuitively  appealing  interval  score  as  a  utility 
function  in  interval  estimation  that  addresses  width  as  well  as  coverage. 
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1  Introduction 


One  of  the  major  purposes  of  statistical  analysis  is  to  make  forecasts  for  the  future,  and  to 
provide  suitable  measures  of  the  uncertainty  associated  with  them.  Consequently,  forecasts 
should  be  probabilistic  in  nature,  taking  the  form  of  probability  distributions  over  future 
quantities  or  events  (Dawid  1984).  Indeed,  over  the  past  two  decades  probabilistic  forecast¬ 
ing  has  become  routine  in  applications  such  as  weather  and  climate  prediction  (Palmer  2002; 
Gneiting  and  Raftery  2005),  stochastic  finance  (DufRe  and  Pan  1997)  and  macroeconomic 
forecasting  (Garratt,  Lee,  Pesaran  and  Shin  2003).  In  the  statistical  literature,  advances 
in  Markov  chain  Monte  Carlo  methodology  (see,  for  example,  Besag,  Green,  Higdon  and 
Mengersen  1995)  have  led  to  explosive  growth  in  the  use  of  predictive  distributions,  mostly 
in  the  form  of  Monte  Carlo  samples  from  posterior  predictive  distributions  of  quantities 
of  interest.  Gneiting,  Raftery,  Balabdaoui  and  Westveld  (2003)  and  Gneiting,  Balabdaoui 
and  Raftery  (2005)  contend  that  the  goal  of  probabilistic  forecasting  is  to  maximize  the 
sharpness  of  the  predictive  distributions  subject  to  calibration.  Calibration  refers  to  the  sta¬ 
tistical  consistency  between  the  distributional  forecasts  and  the  observations,  and  is  a  joint 
property  of  the  forecasts  and  the  events  or  values  that  materialize.  Sharpness  refers  to  the 
concentration  of  the  predictive  distributions  and  is  a  property  of  the  forecasts  only. 

Scoring  rules  provide  summary  measures  for  the  evaluation  of  probabilistic  forecasts, 
by  assigning  a  numerical  score  based  on  the  forecast  and  on  the  event  or  value  that  ma¬ 
terializes.  In  terms  of  elicitation,  the  role  of  scoring  rules  is  to  encourage  the  assessor  to 
make  careful  assessments  and  to  be  honest  (Garthwaite,  Kadane  and  O’Hagan  2005).  In 
terms  of  evaluation,  scoring  rules  measure  the  quality  of  the  probabilistic  forecasts,  reward 
probability  assessors  for  forecasting  jobs,  and  rank  competing  forecast  procedures.  Me¬ 
teorologists  refer  to  this  broad  task  as  forecast  verification,  and  much  of  the  underlying 
methodology  has  been  developed  by  atmospheric  scientists  (Jolliffe  and  Stephenson  2003). 
In  a  Bayesian  context,  scores  are  frequently  referred  to  as  utilities,  thereby  emphasizing  the 
Bayesian  principle  of  maximizing  the  expected  utility  of  a  predictive  distribution  (Bernardo 
and  Smith  1994).  We  take  scoring  rules  to  be  positively  oriented  rewards  that  a  forecaster 
wishes  to  maximize.  Specifically,  if  the  forecaster  quotes  the  predictive  distribution  P  and 
the  event  x  materializes,  her  reward  is  S(P,x).  The  function  S(P,-)  takes  values  in  the 
extended  real  line  M  =  [—00,00],  and  we  write  S(P,Q)  for  the  expected  value  of  S(P,-) 
under  Q.  Suppose,  then,  that  the  forecaster’s  best  judgement  is  the  distributional  forecast 
Q.  The  forecaster  has  no  incentive  to  predict  any  P  ^  Q,  and  is  encouraged  to  quote  her 
true  belief,  P  =  Q,  if  S(Q,  Q )  >  S(P,  Q )  with  equality  if  and  only  if  P  =  Q.  A  scoring  rule 
with  this  property  is  said  to  be  strictly  proper.  If  S(Q,Q)  >  S(P,Q )  for  all  P  and  Q  the 
scoring  rule  is  said  to  be  proper.  Propriety  is  essential  in  scientific  and  operational  forecast 
evaluation,  and  our  case  study  below  provides  a  striking  example  of  some  of  the  difficulties 
resulting  from  the  use  of  intuitively  appealing  but  improper  scoring  rules. 

In  estimation  problems,  strictly  proper  scoring  rules  provide  attractive  loss  and  utility 
functions  that  can  be  tailored  to  a  scientific  problem.  To  fix  the  idea,  suppose  that  we 
wish  to  fit  a  parametric  model  Pq  based  on  a  sample  X\, . . . ,  Xn.  To  estimate  9,  we  might 
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measure  the  goodness-of-fit  by  the  mean  score 


1  n 

Sn(6)  =  -^2S(Pe,Xi), 

where  S'  is  a  strictly  proper  scoring  rule.  If  do  denotes  the  true  parameter,  asymptotic 
arguments  indicate  that  argmaxg  Sn(d)  — >  6q  as  n  — >  oo.  This  suggests  a  general  approach 
to  estimation:  choose  a  strictly  proper  scoring  rule  that  is  tailored  to  the  scientific  problem 
at  hand,  maximize  Sn{0 )  over  the  parameter  space,  and  take  0n  =  arg maxo  Sn(9)  as  the 
optimum  score  estimator  based  on  the  scoring  rule  S.  Pfanzagl  (1969)  and  Birge  and 
Massart  (1993)  studied  this  approach  under  the  heading  of  minimum  contrast  estimation. 
Maximum  likelihood  estimation  forms  a  special  case  of  optimum  score  estimation,  and 
optimum  score  estimation  forms  a  special  case  of  M-estimation  (Huber  1964),  in  that  the 
function  to  be  optimized  derives  from  a  strictly  proper  scoring  rule.  The  appeal  of  optimum 
score  estimation  lies  in  the  potential  adaptation  of  the  scoring  rule  to  the  problem  at  hand. 
Apparently,  this  approach  has  only  very  recently  been  explored  (Buja,  Stuetzle  and  Shen 
2005;  Gneiting,  Raftery,  Westveld  and  Goldman  2005). 

This  paper  reviews  and  develops  the  theory  of  proper  scoring  rules  on  general  probability 
spaces,  proposes  and  discusses  examples  thereof,  and  supplies  case  studies.  The  remainder  of 
the  paper  is  organized  as  follows.  Section  2  states  a  fundamental  characterization  theorem, 
reviews  the  links  between  proper  scoring  rules,  information  measures,  entropy  functions 
and  Bregman  divergences,  and  introduces  skill  scores.  Section  3  turns  to  scoring  rules  for 
categorical  variables.  We  provide  a  rigorous  version  of  the  Savage  (1971)  representation 
and  relate  to  a  more  recent  characterization  of  Schervish  (1989).  Bremnes  (2004,  p.  346) 
noted  that  the  literature  on  scoring  rules  for  probabilistic  forecasts  of  continuous  variables 
is  sparse.  We  address  this  issue  in  Section  4  where  we  discuss  the  spherical,  pseudospher- 
ical,  logarithmic  and  quadratic  scores.  The  continuous  ranked  probability  score  has  lately 
attracted  the  attention  of  meteorologists,  enjoys  appealing  properties,  and  might  serve  as 
a  standard  score  in  evaluating  probabilistic  forecasts  of  real- valued  variables.  It  forms  a 
special  case  of  a  novel  and  very  general  type  of  scoring  rule,  the  energy  score.  Section  5 
introduces  an  even  more  general  construction  that  is  based  on  negative  definite  functions 
and  inequalities  of  Hoeffding  type,  with  side  results  on  expectation  inequalities  and  posi¬ 
tive  definite  kernels  that  are  of  interest  in  their  own  right.  Section  6  studies  scoring  rules 
for  quantile  and  interval  forecasts.  We  show  the  class  of  proper  scoring  rules  for  quantile 
forecasts  to  be  larger  than  conjectured  by  Cervera  and  Munoz  (1996)  and  introduce  the 
interval  score ,  a  scoring  rule  for  prediction  intervals  that  is  proper  and  has  intuitive  appeal. 
In  Section  7  we  relate  proper  scoring  rules  to  Bayes  factors  and  to  cross-validation,  and 
propose  a  novel  form  of  cross-validation,  random-fold  cross-validated  likelihood.  Section  8 
presents  the  case  study  on  the  use  of  scoring  rules  in  the  evaluation  of  probabilistic  weather 
forecasts.  Section  9  turns  to  optimum  score  estimation  and  closes  the  paper.  We  discuss 
point,  quantile  and  interval  estimation,  and  propose  the  use  of  the  interval  score  as  a  utility 
function  that  addresses  width  as  well  as  coverage.  Scoring  rules  show  a  superficial  analogy 
to  statistical  depth  functions,  as  we  briefly  discuss  in  the  Appendix. 
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2  Characterizations  of  proper  scoring  rules 


We  introduce  notation,  provide  characterizations  of  (strictly)  proper  scoring  rules,  and  relate 
them  to  information  measures  and  Bregnran  divergences.  The  discussion  is  more  technical 
than  in  the  remainder  of  the  paper,  and  readers  with  more  applied  interests  might  skip 
ahead  to  Section  2.3,  in  which  we  discuss  skill  scores,  without  significant  loss  of  continuity. 

2.1  Proper  scoring  rules  and  convex  functions 

We  consider  probabilistic  forecasts  on  a  general  sample  space  P.  Let  A  be  a  a- algebra  of 
subsets  of  P,  and  let  V  be  a  convex  class  of  probability  measures  on  (P,*4).  A  function 
defined  on  P  and  taking  values  in  the  extended  real  line,  M  =  [— oo,  oo],  is  V- quasiintegrable 
if  it  is  measurable  with  respect  to  A  and  is  quasiintegrable  with  respect  to  all  P  £  V  (Bauer 
2001,  p.  64).  A  probabilistic  forecast  is  any  probability  measure  P  G  V.  A  scoring  rule  is  any 
extended  real- valued  function  S  :  V  x  P  — >•  M  such  that  S(P,  •)  is  P-quasiintegrable  for  all 
Hence,  if  the  forecast  is  P  and  to  materializes,  the  forecaster’s  reward  is  S(P,co).  We 
permit  algebraic  operations  on  the  extended  real  line  and  deal  with  the  respective  integrals 
and  expectations  as  described  in  Section  2.1  of  Mattner  (1997)  or  Section  3.1  of  Griinwald 
and  Dawid  (2004).  We  write 

S(P,  Q)  =  I  S{P,co)  d Q(u) 

for  the  expected  score  under  Q  when  the  probabilistic  forecast  is  P.  The  scoring  rule  S  is 
proper  relative  to  V  if 


S(Q,  Q)  >  S(P,  Q)  for  all  P,QeV.  (1) 

It  is  strictly  proper  relative  to  V  if  (1)  holds  with  equality  if  and  only  if  P  =  Q,  thereby 
encouraging  honest  quotes  by  the  forecaster.  Clearly,  finite  sums  of  (strictly)  proper  scoring 
rules  and  P-integrable  functions  are  (strictly)  proper.  The  term  was  apparently  coined  by 
Winkler  and  Murphy  (1968,  p.  754),  but  the  general  idea  dates  back  at  least  to  Good 
(1952,  p.  112).  In  a  parametric  context,  and  with  respect  to  estimators,  Lehmann  and 
Casella  (1998,  p.  157)  refer  to  the  defining  property  in  (1)  as  risk  unbiasedness. 

A  function  G  :  V  — ►  M  is  convex  if 

G((l  -  X)Pq  +  APi)  <  (1  —  A)  G(Pq)  +  A  G(Pi)  for  all  A  e  (0, 1),  P0,  P1  €  V.  (2) 

It  is  strictly  convex  if  (2)  holds  with  equality  if  and  only  if  Pq  =  P\.  A  function  G*(P,  ■ )  : 
P  -»  1  is  a  subtangent  of  G  at  the  point  P  G  V  if  it  is  integrable  with  respect  to  P, 
quasiintegrable  with  respect  to  all  Q  6  V,  and 

G(Q)  >  G(P)  +  j  G*(P,  lo)  d (Q  -  P)(u)  (3) 

for  all  Q  G  V.  The  following  characterization  theorem  is  more  general  and  considerably 
simpler  than  previous  results  by  McCarthy  (1956)  and  Hendrickson  and  Buehler  (1971). 
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Definition  2.1  A  scoring  rule  S  :  P  x  12  — ►  M  is  regular  relative  to  the  class  P  if  S(P,  Q ) 
is  real-valued  for  all  P,Q  £  V,  except  possibly  that  S(P,  Q )  =  — oo  if  P  /  Q. 

Theorem  2.2  A  regular  scoring  rule  S  :  V  x  17  — >  M  is  (strictly)  proper  relative  to  the  class 
V  if  and  only  if  there  exists  a  (strictly)  convex,  real-valued  function  G  on  V  such  that 

S{P ,  tv)  =  G(P)  -  J  G*  (P,  tv)  d P(v)  +  G*  (P,  tv)  (4) 

for  P  G  V  and  tv  G  D,  where  G*(P ,  • )  :  fl  — >  M  is  a  subtangent  of  G  at  the  point  P  G  V. 

Proof.  If  the  scoring  rule  S  is  of  the  stated  form,  the  subtangent  inequality  (3)  implies 
the  defining  inequality  (1),  that  is,  propriety.  Conversely,  suppose  that  S  is  a  regular 
proper  scoring  rule.  Define  G  :  V  — ■>  M  by  G(P)  =  S(P,P)  =  supgep  S(Q,  P),  which 
is  the  pointwise  supremum  over  a  class  of  convex  functions  and  therefore  is  convex  on  V. 
Furthermore,  the  subtangent  inequality  (3)  holds  with  G*(P,v)  =  S(P,  v).  This  implies 
the  representation  (4)  and  proves  the  claim  for  propriety.  By  analogy  to  an  argument  of 
Hendrickson  and  Buehler  (1971),  strict  inequality  in  (1)  is  equivalent  to  no  subtangent  of 
G  at  P  being  a  subtangent  of  G  at  Q,  for  P,  Q  G  V  and  P  /  Q,  and  this  is  equivalent  to  G 
being  strictly  convex  on  V.  ■ 

Expressed  slightly  differently,  a  regular  scoring  rule  S  is  (strictly)  proper  relative  to  the 
class  V  if  and  only  if  the  expected  score  function  G(P)  =  S(P,P)  is  (strictly)  convex  and 
S(P,  v)  is  a  subtangent  of  G  at  the  point  P,  for  all  P  e  V. 

2.2  Information  measures  and  Bregman  divergences 

Suppose  that  the  scoring  rule  S  is  proper  relative  to  the  class  V.  Following  Griinwald  and 
Dawid  (2004)  and  Buja  et  al.  (2005),  we  call  the  expected  score  function 

G(P)  =  supggp  S(Q,  P),  PGP,  (5) 

the  uncertainty  measure  or  generalized  entropy  function  associated  with  the  scoring  rule  S. 
This  is  the  maximally  achievable  utility,  and  the  term  entropy  function  is  used  as  well.  If 
S  is  regular  and  proper,  we  call 

d(P,  Q)  =  S(Q ,  Q)  -  S(P,  Q ) ,  P,  Q  G  P,  (6) 

the  associated  divergence  function.  The  divergence  function  is  nonnegative,  and  if  S  is 
strictly  proper,  then  d(P,  Q )  is  strictly  positive  unless  P  =  Q.  If  the  sample  space  is  fi¬ 
nite  and  the  entropy  function  is  sufficiently  smooth,  the  divergence  function  becomes  the 
Bregman  divergence  (Bregman  1967).  Bregman  divergences  play  major  roles  in  optimiza¬ 
tion,  and  recently  have  attracted  the  attention  of  the  machine  learning  community  (Collins, 
Schapire  and  Singer  2002).  The  term  Bregman  distance  is  also  used,  even  though  d(P,Q) 
is  not  necessarily  the  same  as  d(Q,P).  An  interesting  problem  is  to  find  conditions  under 
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which  a  divergence  function  d  is  a  score  divergence ,  in  the  sense  that  it  admits  the  repre¬ 
sentation  (6)  for  a  proper  scoring  rule  S,  and  to  describe  principled  ways  of  finding  such 
a  scoring  rule.  The  landmark  paper  by  Savage  (1971)  provides  a  necessary  condition  on 
a  symmetric  divergence  function  d  to  be  a  score  divergence:  If  P  and  Q  are  concentrated 
on  the  same  two  mutually  exclusive  events,  and  identified  with  the  respective  probabilities, 
p,q  €  [0, 1],  then  d(P ,  Q)  reduces  to  a  linear  function  of  (p  —  q)2. 

Friedman  (1983)  and  Nau  (1985)  studied  a  looser  type  of  relationship  between  proper 
scoring  rules  and  distance  measures  on  classes  of  probability  distributions.  They  restrict 
attention  to  metrics,  that  is,  distance  measures  which  are  symmetric  and  satisfy  the  triangle 
inequality,  and  call  a  scoring  rule  S  effective  with  respect  to  a  metric  d  if 

S(Pi,Q)  >  S(P2,Q)  d(P\ ,  Q)  <  d(P2,Q). 

Nau  (1985)  calls  a  metric  co-effective  if  there  is  a  proper  scoring  rule  that  is  effective  with 
respect  to  it.  His  Proposition  1  implies  that  the  h,  loo  and  Hellinger  distances  on  spaces  of 
absolutely  continuous  probability  measures  are  not  co-effective. 

Sections  3  through  5  provide  numerous  examples  of  proper  scoring  rules  on  general 
sample  spaces  with  the  associated  entropy  functions  and  divergence  functions.  For  instance, 
the  classical  logarithmic  score  is  linked  to  Shannon  entropy  and  Kullback-Leibler  divergence. 
Griinwald  and  Dawid  (2004)  and  Buja  et  al.  (2005)  give  further  examples  of  proper  scoring 
rules,  entropy  and  divergence  functions  on  finite  sample  spaces,  and  discuss  the  connections 
to  the  Bregman  distance  in  detail. 

2.3  Skill  scores 

In  practice,  scores  are  aggregated  and  competing  forecast  procedures  are  ranked  by  their 
average  score, 

n 

Sn  =  Yl  S(Pi’Xi)i 
2=1 

over  a  fixed  set  of  forecast  situations.  We  give  examples  of  this  in  case  studies  in  Sections 
6  and  8  below.  Recommendations  for  choosing  a  scoring  rule  can  be  found  in  Section  6  of 
Winkler  (1996)  and  throughout  this  paper. 

Scores  for  competing  forecast  procedures  are  directly  comparable  if  they  refer  to  exactly 
the  same  set  of  situations.  If  scores  for  distinct  sets  of  situations  are  compared,  considerable 
care  needs  to  be  exercised  to  separate  the  confounding  effects  of  intrinsic  predictability  and 
predictive  performance.  For  example,  there  is  substantial  spatial  and  temporal  variability 
in  the  predictability  of  weather  and  climate  elements  (Langland  et  al.  1999;  Campbell  and 
Diebold  2005).  Hence,  a  score  that  is  superior  for  a  given  location  or  season  might  be 
inferior  for  another,  or  vice  versa.  To  address  this  issue,  atmospheric  scientists  have  put 
forth  skill  scores  of  the  form 

ofcst  _  oref 

no  ° n _ °rt 

n  ^opt  _  ^ref  ’ 

°n 
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where  S'„cst  is  the  forecaster’s  average  score,  S,°pt  is  the  mean  score  for  a  hypothetical  ideal 
or  optimal  forecast,  and  S^{  is  the  average  score  for  a  reference  strategy  (Murphy  1973; 
Wilks  1995,  p.  237;  Potts  2003,  p.  27;  Briggs  and  Ruppert  2005).  Skill  scores  are  standard¬ 
ized  in  that  (7)  takes  the  value  1  for  an  optimal  forecast,  which  is  typically  understood  as  a 
point  measure  in  the  event  or  value  that  materializes,  and  the  value  0  for  the  reference  fore¬ 
cast.  Negative  values  of  the  skill  score  indicate  forecasts  that  are  of  lesser  quality  than  the 
reference.  The  reference  forecast  is  typically  a  climatological  forecast,  that  is,  an  estimate 
of  the  marginal  distribution  of  the  predictand.  For  example,  a  climatological  probabilistic 
forecast  for  maximum  temperature  on  Independence  Day  in  Seattle,  Washington  might  be  a 
smoothed  version  of  the  local  historic  record  of  July  4  maximum  temperature.  Climatolog¬ 
ical  forecasts  are  independent  of  the  forecast  horizon;  they  are  calibrated  by  construction, 
but  often  lack  sharpness. 

Unfortunately,  skill  scores  of  the  form  (7)  are  generally  improper,  even  if  the  underlying 
scoring  rule  S  is  proper.  Murphy  (1973)  studied  hedging  strategies  in  the  case  of  the  Brier 
skill  score  for  probability  forecasts  of  a  dichotomous  event.  He  showed  that  the  Brier  skill 
score  is  asymptotically  proper,  in  the  sense  that  the  benefits  of  hedging  become  negligible 
as  the  number  of  independent  forecasts  grows.  Similar  arguments  may  apply  to  skill  scores 
based  on  other  proper  scoring  rules.  Mason’s  (2004)  recent  claim  of  the  propriety  of  the 
Brier  skill  score  rests  upon  unjustified  approximations  and  generally  is  incorrect. 

3  Scoring  rules  for  categorical  variables 

We  now  review  the  representations  of  Savage  (1971)  and  Schervish  (1989)  that  characterize 
scoring  rules  for  probabilistic  forecasts  of  categorical  and  binary  variables,  and  we  give 
examples  of  proper  scoring  rules. 

3.1  Savage  representation 

We  consider  probabilistic  forecasts  of  a  categorical  variable.  Hence,  the  sample  space  H  = 
{1, . . .  ,m}  consists  of  a  finite  number  tti  of  mutually  exclusive  events,  and  a  probabilistic 
forecast  is  a  probability  vector  (p\ , . . .  ,pm).  Using  the  notation  of  Section  2,  we  consider 
the  convex  class  V  =  Vm ,  where 

Vm  =  [p  =  (pi,  ■  ■  •  j Pm)  pi  H - h  Pm  =  l}  • 

A  scoring  rule  S  can  then  be  identified  with  a  collection  of  m  functions 

S(  ■  ,l)  ■■  Vm  —>  M,  7  =  1,...,  777. 

In  other  words,  if  the  forecaster  quotes  the  probability  vector  p  and  the  event  i  materializes, 
her  reward  is  S(p,i).  Theorem  3.2  below  is  a  special  case  of  Theorem  2.2  and  provides  a 
rigorous  version  of  the  Savage  (1971)  representation  of  proper  scoring  rules  on  finite  sample 
spaces.  Our  contributions  lie  in  the  notion  of  regularity,  in  the  rigorous  treatment,  and  in 
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the  introduction  of  appropriate  tools  of  convex  analysis  (Rockafellar  1970,  Sections  23-25). 
Specifically,  let  G  :  Vm  -»Mbea  convex  function.  A  vector  G'(p)  =  [G\  (p), . . . ,  G'm(jp ))  is 
a  subgradient  of  G  at  the  point  p  G  Vm  if 

G(q)>G(p)  +  (G'(p),q-p)  (8) 

for  all  q  G  Vm,  where  (•,•)  denotes  the  standard  scalar  product.  We  assume  that  the 
components  of  G'(p)  are  real- valued,  except  that  we  permit  G[(p)  =  —  oo  if  pi  =  0. 

Definition  3.1  A  scoring  rule  S  for  categorical  forecasts  is  regular  if  S(- ,  i)  is  real-valued 
for  i  =  1, . . . ,  m,  except  possibly  that  S(p,  i )  =  — oo  if  pt  =  0. 

Theorem  3.2  (McCarthy,  Savage)  A  regular  scoring  rule  S  for  categorical  forecasts  is 
(strictly)  proper  if  and  only  if 

S(p,i)  =  G(p)-(G\p),p)  +  G'i(p)  for  i  =  l,...,m,  (9) 

where  G  :  Vm  — >  M  is  a  (strictly)  convex  function  and  G'(p)  is  a  subgradient  of  G  at  the 
point  p,  for  all  p  G  Vm. 


Put  slightly  differently,  a  regular  scoring  rule  S  is  (strictly)  proper  if  and  only  if  the 
expected  score  function  G(p)  =  S(p,p)  is  (strictly)  convex  on  Vrn  and  the  vector  with 
components  S(p,  i)  for  i  =  1, ...  ,m  is  a  subgradient  of  G  at  the  point  p,  for  all  p  G  Vrn . 
In  view  of  these  results,  every  bounded  (strictly)  convex  function  G  on  Vm  generates  a 
regular  (strictly)  proper  scoring  rule.  This  function  G  becomes  the  expected  score  function, 
information  measure  or  entropy  function  (5)  associated  with  the  score,  and  the  divergence 
function  (6)  is  the  respective  Bregman  distance. 

We  now  give  a  number  of  examples.  The  scoring  rules  in  Examples  3.3  through  3.5  are 
strictly  proper,  and  the  score  in  Example  3.6  is  proper  but  not  strictly  proper. 

Example  3.3  (quadratic  or  Brier  score)  If  G(p)  =  Y)]'=\  pj  —  1  then  (9)  yields  the 
quadratic  score  or  Brier  score, 

m  m 

S(p,  i)  =  ~  (<%  -  Pj )2  =  2Pi  -  J^Pj  - 

3= 1  i=l 

where  5ij  =  1  if  i  =  j  and  Sij  =  0  otherwise.  The  associated  Bregman  divergence  is  squared 
Euclidean  distance,  d(p,q)  =  JOJL \  [P]  ~  Qj)2-  This  well-known  scoring  rule  was  proposed 
by  Brier  (1950).  Selten  (1998)  gave  an  axiomatic  characterization. 


Example  3.4  (spherical  score)  Let  a  >  1  and  consider  the  generalized  entropy  function 
G(p)  =  {Jf  jLiPj*)1/01-  This  corresponds  to  the  pseudospherical  score, 


S(p,i) 


pr1 


(EtiP“)(“-i)/a: 


which  reduces  to  the  traditional  spherical  score  when  a  =  2.  The  associated  Bregman 
divergence  is  d(p,q)  =  (E”li?f)1/a  -  E™=i  PjQ* 'V(E”Li  <?")(a-1)/". 


Example  3.5  (logarithmic  score)  Negative  Shannon  entropy,  G(p)  =  1  Pj  log Pj , 

corresponds  to  the  logarithmic  score,  S(p,i)  =  log  pi-  The  associated  Bregman  distance 
is  the  Kullback-Leibler  divergence,  d(p,q)  =  Qj  log(<?j/pj).  This  scoring  rule  is  classi¬ 

cal  and  dates  back  at  least  to  Good  (1952).  Detailed  information-theoretic  perspectives  and 
interpretations  in  terms  of  gambling  returns  can  be  found  in  Roulston  and  Smith  (2002) 
and  Daley  and  Vere- Jones  (2004). 


Example  3.6  (zero-one  score)  The  zero-one  scoring  rule  rewards  a  probabilistic  forecast 
if  the  mode  of  the  predictive  distribution  materializes.  In  case  of  multiple  modes,  the  reward 
is  reduced  proportionally,  that  is, 


S(p,i) 


1  /#  M ( p )  if  i  belongs  to  M ( p ) , 
0  otherwise, 


where  M(p )  =  {*  :  pi  =  maxj=i  denotes  the  set  of  modes  of  p.  This  is  also  known 

as  the  misclassification  loss,  and  the  meteorological  literature  uses  the  term  success  rate  to 
denote  case-averaged  zero-one  scores  (see,  for  example,  Toth,  Zhu  and  Marchok  2001).  The 
associated  expected  score  or  generalized  entropy  function  (5)  is  G(p)  =  rna x.j=i,...,mPj,  and 
the  divergence  function  (6)  becomes 


d(p,  q )  =  max  q3 


J2j£M(p)  Qj 

#M(p) 


This  does  not  define  a  Bregman  distance,  because  the  entropy  function  is  neither  differen¬ 
tiable  nor  strictly  convex. 


The  scoring  rules  in  the  above  examples  are  symmetric,  in  the  sense  that 

S((pi,...,pm),i)  =  S((pni,...,p^m),TTi) 

for  all  p  G  Vm,  for  all  permutations  it  on  m  elements  and  for  all  events  i  = 

Winkler  (1994;  1996,  Section  5)  argued  that  symmetric  rules  do  not  always  appropriately 
reward  forecasting  skill  and  called  for  asymmetric  ones,  particularly  in  situations  in  which 
traditionally  skills  scores  have  been  employed.  Asymmetric  (strictly)  proper  scoring  rules 
can  be  generated  by  applying  Theorem  3.2  to  (strictly)  convex  entropy  functions  G  that 
are  not  invariant  under  coordinate  permutation. 


3.2  Schervish  representation 

The  classical  case  of  yes  or  no  forecasts  for  a  dichotomous  event  suggests  further  discussion. 
We  follow  the  literature  in  considering  the  sample  space  D  =  {1,  0}.  A  probabilistic  forecast 
is  a  quoted  probability  p  G  [0, 1]  for  yes  or  1.  A  scoring  rule  S  can  be  identified  with  a  pair 
of  functions  S(  ■ ,  1)  :  [0, 1]  — >  M  and  S(  ■ ,  0)  :  [0, 1]  — >  M.  Hence,  S(p,  1)  is  the  forecaster’s 
reward  if  she  quotes  p  and  the  event  materializes,  and  S(p,  0)  is  the  reward  if  she  quotes 
p  and  the  event  does  not  materialize.  Note  the  subtle  change  from  the  previous  section, 
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where  we  used  the  convex  class  V-2  =  {(pi,P2)  6  K2  :  Pi  6  [0, 1],P2  =  1  —  Pi}  in  place  of  the 
unit  interval,  V  =  [0, 1],  to  represent  probability  measures  for  binary  events. 

A  scoring  rule  for  binary  variables  is  regular  if  £(  ■ ,  1)  and  £(  ■ ,  0)  are  real- valued,  except 
possibly  that  £(0,1)  =  — oo  or  £(1,0)  =  —  oo.  A  variant  of  Theorem  3.2  shows  that  every 
regular  (strictly)  proper  scoring  rule  is  of  the  form 

S(p,  1)  =  G(p)  +  (1  -  p)  G\p ),  S(p,  0)  =  G(p)  -  pG'(p ),  (10) 

where  G  :  [0, 1]  — >  M  is  a  (strictly)  convex  function  and  G'(p)  is  a  subgradient  of  G  at  the 
point  p  E  [0, 1],  in  the  sense  that 

G{q)  >  G(p)  +  G'(p)(q  —  p) 

for  all  q  E  [0, 1].  The  subgradient  G'(p)  is  real-valued,  except  that  we  permit  G"(0)  =  — oo 
and  G'(l)  =  oo.  If  G  is  differentiable  at  an  interior  point  p  E  (0, 1)  then  G'(p)  is  unique 
and  equals  the  derivative  of  G  at  p.  Related,  but  slightly  less  general  results  were  given  by 
Shuford,  Albert  and  Massengil  (1966). 

The  Savage  representation  (10)  implies  various  interesting  properties  of  regular  (strictly) 
proper  scoring  rules.  For  instance,  we  conclude  from  Theorem  24.2  of  Rockafellar  (1970) 
that 

S(p,  1)  =  lim  G(q)  -  f1  ( G'(q )  -  G'(p))  d q  (11) 

9-*1  Jp 

for  p  E  (0,1),  and  since  G\p)  is  (strictly)  increasing,  S(p,  1)  is  (strictly)  increasing,  too. 
Similarly,  S(p,  0)  is  (strictly)  decreasing,  as  one  intuitively  expects.  Alternative  proofs  of 
these  and  other  results  can  be  found  in  the  appendix  of  Schervish  (1989). 

Schervish  (1989,  p.  1861)  suggested  that  his  Theorem  4.2  generalizes  the  Savage  rep¬ 
resentation.  Given  Savage’s  (1971,  p.  793)  assessment  of  his  representation  (9.15)  as  “fig¬ 
urative,”  the  claim  can  well  be  justified.  However,  in  its  rigorous  form  (10)  the  Savage 
representation  applies  to  a  larger  class  of  scoring  rules  than  that  of  Schervish. 

Theorem  3.7  (Schervish)  Suppose  S  is  a  regular  scoring  rule.  Then  S  is  proper  and 
such  that  5(0, 1)  =  limp^o  S(p,  1)  ,  5(0, 0)  =  limp^o  S{p,  0)  and  both  S(p ,  1)  and  S(p ,  0)  are 
left  continuous  if  and  only  if  there  exists  a  measure  v  on  (0, 1)  such  that 

S(p,l)  =  S(l,l)  -  [  (l-q)u(dq),  S(p,  0)  =  £(0, 0)  -  [  qv{dq),  (12) 

d\p,l)  J[o  ,p) 

for  all  p  E  [0, 1].  The  scoring  rule  is  strictly  proper  if  and  only  if  v  assigns  positive  measure 
to  every  open  interval. 

Proof.  Suppose  £  satisfies  the  assumptions  of  the  theorem.  To  prove  that  S(p,  1)  is  of 
the  form  (12),  consider  the  representation  (11),  identify  the  increasing  function  G'(p)  with 
the  left  continuous  distribution  function  of  a  measure  v  on  (0,1),  and  apply  the  partial 
integration  formula.  The  proof  of  the  representation  for  S(p.  0)  is  analogous.  For  the  proof 
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of  the  converse,  reverse  the  above  steps.  The  statement  for  strict  propriety  follows  from 
well-known  properties  of  convex  functions.  ■ 


Pearl  (1978)  considered  scoring  rules  from  an  economic  perspective,  and  Schervish  (1989) 
proposed  a  general  method  for  comparing  binary  forecasters  within  the  framework  of  two- 
decision  problems.  A  two-decision  problem  can  be  characterized  by  a  cost-loss  ratio  q  £  [0, 1] 
that  reflects  the  relative  costs  of  the  two  possible  types  of  inferior  decision.  The  measure 
zz(d q)  in  the  representation  (12)  assigns  relevance  to  distinct  cost-loss  ratios.  If  the  expected 
score  function,  G,  is  sufficiently  smooth,  then  v(dq)  has  Lebesgue  density  — G"(q )  (Buja  et 
al.  2005).  For  instance,  the  quadratic  or  Brier  score  has  entropy  function  G(p)  =  2p(l  —  p) 
and  corresponds  to  a  uniform  measure.  The  logarithmic  score  derives  from  Shannon  entropy, 
G (p)  =  plogp  +  (1  —  p)  log(l  —  p),  and  corresponds  to  the  infinite  measure  with  Lebesgue 
density  (</(l  —  g))_1.  Buja  et  al.  (2005)  took  this  approach  a  major  step  further.  They 
gave  a  comprehensive  discussion  of  scoring  rules  for  dichotomous  events  and  introduced  a 
parametric  family  of  proper  scoring  rules,  which  includes  the  quadratic  or  Brier  score,  the 
logarithmic  score,  a  scoring  rule  that  underlies  boosting  and  a  left-continuous  version  of  the 
zero-one  score  as  special  cases. 

4  Scoring  rules  for  continuous  variables 

Bremnes  (2004,  p.  346)  noted  that  the  literature  on  scoring  rules  for  probabilistic  forecasts 
of  continuous  variables  is  sparse.  We  address  this  issue  in  the  following. 

4.1  Scoring  rules  for  density  forecasts 

Let  p  be  a  cr-finite  measure  on  the  measurable  space  (fl,4).  For  a  >  1,  let  CQ  denote  the 
class  of  probability  measures  on  (0,4)  that  are  absolutely  continuous  with  respect  to  p 
and  have  /i-density  p  such  that 

Ma  =  (yj  p{u)a  p{du)^j 

is  finite.  We  identify  a  probabilistic  forecast  P  £  Ca  with  its  /.i-density,  p,  and  call  p  a 
predictive  density  or  density  forecast.  Predictive  densities  are  defined  only  up  to  a  set  of  p- 
measure  zero.  Whenever  appropriate,  we  follow  Bernardo  (1979,  p.  689)  and  use  the  unique 
version  defined  by  p(co)  =  lim p^q  P(Sp(lo)) / p(Sp(lo))  where  Sp(lo )  is  a  sphere  of  radius  p 
centered  at  w. 

We  begin  by  discussing  scoring  rules  that  correspond  to  Examples  3.3,  3.4  and  3.5.  The 
quadratic  score, 

QS (p,u})  =  2 p(u)  -  ||p||l, 

is  strictly  proper  relative  to  the  class  L  2-  It  has  expected  score  or  generalized  entropy 
function  G(p)  =  ||p||2>  and  the  associated  divergence  function  is  d(p,q)  =  \\p  —  q Hi-  Good 
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(1971)  proposed  the  pseudo  spherical  score, 

PseudoS  (p,u)  =  p(w)a_1/ |b||a_1, 

that  reduces  to  the  spherical  score  when  a  =  2.  He  described  original  and  generalized 
versions  of  the  score  —  a  distinction  that  in  a  measure-theoretic  framework  is  obsolete. 
The  pseudospherical  score  is  strictly  proper  relative  to  the  class  Ca.  The  strict  convexity 
of  the  associated  entropy  function,  G(p)  =  ||p||a,  and  the  nonnegativity  of  the  divergence 
function  are  straightforward  consequences  of  the  Holder  and  Minkowski  inequalities. 

The  logarithmic  score, 

LogS(p,  u>)  =  log  p(u), 

emerges  as  the  limiting  case  a  — *  1  in  suitably  scaled  pseudospherical  scores.  This  scoring 
rule  was  proposed  by  Good  (1952)  and  has  been  widely  used  since,  sometimes  under  other 
names,  including  the  predictive  deviance  (Knorr-Held  and  Rainer  2001)  and  the  ignorance 
score  (Roulston  and  Smith  2002).  The  logarithmic  score  is  strictly  proper  relative  to  the 
class  C\  of  the  probability  measures  that  are  dominated  by  //.  The  associated  expected  score 
function  or  information  measure  is  negative  Shannon  entropy,  and  the  divergence  function 
becomes  the  classical  Kullback-Leibler  divergence. 

Bernardo  (1979,  p.  689)  argued  that  “when  assessing  the  worthiness  of  a  scientist’s  final 
conclusions,  only  the  probability  he  attaches  to  a  small  interval  containing  the  true  value 
should  be  taken  into  account.”  This  seems  subject  to  debate,  and  atmospheric  scientists 
have  argued  otherwise,  putting  forth  scoring  rules  that  are  sensitive  to  distance  (Epstein 
1969;  Stael  von  Holstein  1970).  That  said,  Bernardo  (1979)  studied  local  scoring  rules 
S(p,lo)  that  depend  on  the  predictive  density  p  only  through  its  value  at  the  event  oj  that 
materializes.  Assuming  regularity  conditions,  he  showed  that  every  proper  local  scoring 
rule  is  of  the  form  S(p,co)  =  alogp(cu)  +  f{uj)  for  some  constant  a  >  0  and  function  /. 
Consequently,  the  linear  score,  LinS(p,  uj)  =  p(co),  is  not  a  proper  scoring  rule,  despite  its 
intuitive  appeal.  For  instance,  let  p  and  u  denote  the  Lebesgue  densities  of  the  standard 
normal  distribution  and  the  uniform  distribution  on  (— e,  e),  respectively.  If  e  <  y/log  2  then 

LinS(u,  p)  =  ±  J_ ^  e~x^2  dx  >  =  LmS(p,  p), 

in  violation  of  propriety.  Essentially,  the  linear  score  encourages  overprediction  at  the  modes 
of  an  assessor’s  true  predictive  density  (Winkler  1969).  The  probability  score  of  Wilson, 
Burrows  and  Lanzinger  (1999)  integrates  the  predictive  density  over  a  neighborhood  of  the 
observed,  real-valued  quantity.  This  resembles  the  linear  score  and  is  not  a  proper  score 
either. 

If  Lebesgue  densities  on  the  real  line  are  used  to  predict  discrete  observations,  the 
logarithmic  score  encourages  the  placement  of  artificially  high  density  ordinates  on  the 
target  values  in  question.  This  problem  emerged  in  the  Evaluating  Predictive  Uncertainty 
Challenge  at  the  PASCAL  Challenges  Workshop  in  Southampton  in  April  2005  and  is 
described  at  www.kyb.tuebingen.mpg.de/bs/people/jqc/southampton.  It  disappears  if 
scores  in  terms  of  predictive  cumulative  distribution  functions  are  used,  or  if  the  sample 
space  is  reduced  to  the  target  values  in  question. 
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4.2  Continuous  ranked  probability  score 

The  restriction  to  predictive  densities  is  frequently  impractical.  Probabilistic  quantita¬ 
tive  precipitation  forecasts,  for  instance,  involve  distributions  with  a  point  mass  at  zero 
(Krzysztofowicz  and  Sigrest  1999;  Bremnes  2004).  This  could  be  handled  by  considering 
densities  with  respect  to  a  mixed  dominating  measure  in  place  of  Lebesgue  measure,  but 
it  seems  more  compelling  to  define  scoring  rules  directly  in  terms  of  predictive  cumulative 
distribution  functions.  Furthermore,  the  aforementioned  scores  are  not  sensitive  to  dis¬ 
tance,  meaning  that  no  credit  is  given  for  assigning  high  probabilities  to  values  near  but 
not  identical  to  the  one  materializing.  Sensitivity  to  distance  seems  particularly  desirable 
when  the  predictive  distributions  tend  to  be  multimodal. 

To  address  this  situation,  let  V  consist  of  the  Borel  probability  measures  on  M.  We 
identify  a  probabilistic  forecast,  that  is,  a  member  of  the  class  V,  with  its  cumulative 
distribution  function  F,  and  we  use  standard  notation  for  the  elements  of  the  sample  space 
M.  Let  1  {y  >  x}  denote  the  function  that  attains  the  value  1  if  y  >  x  and  the  value  0 
otherwise.  The  continuous  ranked  probability  score  is  defined  as 

/OO 

(F(y)  -  l{y  >  x })2  d y  (13) 

-OO 

and  corresponds  to  the  integral  of  the  Brier  scores  for  the  associated  binary  probabilistic 
forecasts  at  all  real- valued  thresholds  (Matheson  and  Winkler  1976;  Hersbach  2000). 

Applications  of  the  continuous  ranked  probability  score  have  been  hampered  by  a  lack 
of  analytic  expressions,  and  the  use  of  numerical  quadrature  rules  for  the  evaluation  of 
(13)  has  been  proposed  instead  (Stael  von  Holstein  1977;  Unger  1985).  However,  analytic 
expressions  can  be  derived  in  some  cases  using  the  following  results.  By  Lemma  2.2  of 
Baringhaus  and  Franz  (2004)  or  identity  (17)  of  Szekely  and  Rizzo  (2005), 

CRPS (F,x)  =  ^Ef\X-X'\  -Ef \X-x\,  (14) 

where  X  and  X’  are  independent  copies  of  a  random  variable  with  distribution  function  F 
and  finite  first  moment.  For  normal  predictive  distributions,  it  follows  readily  that 


where  ip  and  $  denote  the  probability  density  function  and  the  cumulative  distribution 
function  of  a  standard  normal  random  variable,  respectively.  Similarly,  analytical  expres¬ 
sions  can  be  given  for  other  distributions.  If  a  closed  form  expression  is  not  available  but 
random  numbers  with  distribution  F  can  be  generated,  the  right-hand  side  of  (14)  can  be 
evaluated  by  Monte  Carlo  techniques. 

The  continuous  ranked  probability  score  is  proper  relative  to  the  class  V  and  strictly 
proper  relative  to  the  subclass  V\  of  the  Borel  probability  measures  that  have  finite  first 
moment.  The  associated  expected  score  function  or  information  measure, 

r°°  i 

G(F)  =  -  /  F(y)(l-F(y))dy  =  --EF\X-X'\, 

J  —  OO  ^ 
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is  negative  selectivity  (Matheron  1984),  and  the  respective  divergence  function, 

/OO 

(F(y)  -  G(y))2  d y, 

-OO 

is  of  the  Cramer-von  Mises  type. 

The  continuous  ranked  probability  score  has  lately  attracted  renewed  interest  in  the 
atmospheric  sciences  community  (Hersbach  2000;  Candille  and  Talagrand  2005;  Gneiting, 
Raftery,  Westveld  and  Goldman  2005;  Grirnit,  Gneiting,  Berrocal  and  Johnson  2005).  It  is 
typically  used  in  negative  orientation,  say  CRPS*(F,  x)  =  —  CRPS(F,  x).  The  representa¬ 
tion  (14)  can  then  be  written  as 

CRPS*(T»  =  Ef\X-x\  -  ^Ef \X-X'\, 

and  this  sheds  new  light  on  the  score.  In  negative  orientation,  the  continuous  ranked 
probability  score  can  be  reported  in  the  same  unit  as  the  observations,  and  it  generalizes 
the  absolute  error  to  which  it  reduces  if  F  is  a  deterministic  forecast  —  that  is,  a  point 
measure.  Thus,  the  continuous  ranked  probability  score  provides  a  direct  way  of  comparing 
deterministic  and  probabilistic  forecasts. 


4.3  Energy  score 


We  introduce  a  generalization  of  the  continuous  ranked  probability  score  that  draws  on 
Szekely’s  (2003)  statistical  energy  perspective.  Let  Vp,  (3  £  (0,2),  denote  the  class  of  the 
Borel  probability  measures  P  on  Rm  which  are  such  that  Ep||Jf||^  is  finite,  where  ||  •  || 
denotes  the  Euclidean  norm.  We  define  the  energy  score, 


ES(P,  x) 


l-EP\\X-  X'f  -EP\\X-xf, 


(15) 


where  X  and  X'  are  independent  copies  of  a  random  vector  with  distribution  P  £  Vp.  This 
generalizes  the  continuous  ranked  probability  score,  to  which  (15)  reduces  when  (3  =  1  and 
m  =  1,  by  allowing  for  an  index  (3  £  (0,2),  and  by  applying  it  to  distributional  forecasts 
of  a  vector-valued  quantity.  The  evaluation  of  (15)  is  straightforward  if  P  is  discrete, 
and  Monte  Carlo  techniques  can  be  used  otherwise.  The  energy  score  has  the  potentially 
desirable  property  of  invariance  under  joint  translation  and/or  rotation  of  P  and  x.  In 
negative  orientation  it  can  be  interpreted  as  a  generalization  of  the  absolute  error  of  order 
(3.  By  Theorem  1  of  Szekely  (2003),  the  energy  score  is  strictly  proper  relative  to  the  class 
Vg .  For  a  different  and  more  general  argument,  see  Section  5.1  below. 

The  energy  score  with  index  f3  £  (0,2)  applies  to  all  Borel  probability  measures  on 
by  defining 


ES  (P,x)  =  - 


/32^"2r(f  +  f)  f  \cp(y)-e^y)\2 


\\y\\m+y 


d  y, 


(16) 


vrm/2T(l-f)  mm 

where  ip  denotes  the  characteristic  function  of  P.  Essentially,  the  score  computes  a  weighted 
distance  between  the  characteristic  function  of  P  and  the  characteristic  function  of  the 
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point  measure  at  the  value  that  materializes.  This  is  akin  to  the  metric  studied  by  Eaton, 
Giovagnoli  and  Sebastiani  (1996,  p.  124).  If  P  belongs  to  Vp,  Theorem  1  of  Szekely  (2003) 
implies  the  equality  of  the  right-hand  sides  in  (15)  and  (16).  In  the  limiting  case  (3  =  2 , 
the  right-hand  side  of  (15)  reduces  to  the  squared  Euclidean  distance  between  x  and  the 
mean  of  P.  This  score  is  proper  but  not  strictly  proper  relative  to  the  class  V-2  of  the  Borel 
probability  measures  P  for  which  Ep||X||2  is  finite. 

4.4  Predictive  model  choice  criterion 

The  predictive  model  choice  criterion  of  Laud  and  Ibrahim  (1995)  and  Gelfand  and  Ghosh 
(1998)  has  lately  attracted  the  attention  of  the  statistical  community.  Suppose  that  we 
fit  a  predictive  model  to  observed  data  x\, . . .  ,xn.  The  predictive  model  choice  criterion 
(PMCC)  assesses  the  model  fit  through  the  quantity 

n  n 

PMCC  =  £>i-*i)2  +  5>i, 

i=l  i=  1 

where  ji%  and  o\  denote  the  expected  value  and  the  variance,  respectively,  of  a  replicate 
variable  Xt,  given  the  model  and  the  observations.  Within  the  framework  of  scoring  rules, 
the  PMCC  corresponds  to  the  positively  oriented  score 

S(F,x)  =  -  ( EfX  -  x)2  -  XarF{X), 

where  X  is  a  random  variable  with  distribution  F  and  finite  variance.  This  is  not  a  proper 
scoring  rule:  if  the  forecaster’s  true  belief  is  F  and  if  she  wishes  to  maximize  the  expected 
score,  she  will  quote  the  point  measure  at  EpX  —  that  is,  a  deterministic  forecast  —  rather 
than  the  predictive  distribution  F. 

One  might  also  be  tempted  to  consider  an  alternative  to  the  continuous  ranked  proba¬ 
bility  score  (13)  in  terms  of  E1-1,  say 

S(F,x)  =  ~j*  (E~V)  -  a:)2  d u=  -EF(X-x)2. 

This  relates  to  the  Mallows  (1972)  distance  and  the  Wasserstein  metric  and  does  not  define 
a  proper  scoring  rule  either,  with  a  hedging  strategy  that  is  identical  to  the  above. 

5  Proper  scoring  rules,  negative  definite  functions  and  in¬ 
equalities  of  Hoeffding  type 

In  this  section  we  employ  negative  definite  functions  to  construct  proper  scoring  rules,  and 
we  present  expectation  inequalities  that  are  of  independent  interest. 


15 


5.1  Scoring  rules  associated  with  negative  definite  kernels 

Let  12  be  a  nonempty  set.  A  real- valued  function  g  on  12  x  12  is  said  to  be  a  negative  definite 
kernel  if  it  is  symmetric  in  its  arguments  and  Ya=i  X)j=i  aiaj  6{xii  xj)  <  0  for  all  positive 
integers  n,  all  a\, . . . ,  an  G  M  that  sum  to  zero,  and  all  x\, . . . ,  xn  G  12.  Numerous  examples 
of  negative  definite  kernels  can  be  found  in  Berg,  Christensen  and  Ressel  (1984)  and  the 
references  therein. 

We  now  give  the  key  result  in  this  section. 

Theorem  5.1  Let  12  be  a  Hausdorfif  space  and  let  g  be  a  nonnegative,  continuous  nega¬ 
tive  definite  kernel  on  12  x  12.  For  a  Borel  probability  measure  P  on  12,  let  X  and  X'  be 
independent  random  variables  with  distribution  P.  Then  the  scoring  rule 

S(P ,  x)=X-EP  g(X,  X')  -  EP  g(X,  x)  (17) 

is  proper  relative  to  the  class  of  the  Borel  probability  measures  P  on  12  for  which  the  expec¬ 
tation  Ep  g(X,  X')  is  finite. 

Proof.  Let  P  and  Q  be  Borel  probability  measures  on  12,  and  suppose  that  X,  X'  and 
Y,  Y'  are  independent  random  variates  with  distribution  P  and  Q,  respectively.  We  need 
to  show  that 

l-EQg(Y,Yr)  >  i  EPg(X,X ')  -  EPiQg{X,Y).  (18) 

If  the  expectation  Epq  g(X,  Y)  is  infinite,  the  inequality  is  trivially  satisfied;  if  it  is  finite, 
Theorem  2.1  in  Berg,  Christensen  and  Ressel  (1984,  p.  235)  implies  (18).  ■ 

Next  we  give  examples  of  scoring  rules  that  admit  the  representation  in  Theorem  5.1. 
In  each  case,  we  equip  the  sample  space  with  the  standard  topology.  Note  that  scores  of 
the  type  (17)  are  straightforward  to  evaluate  if  P  is  discrete  and  has  a  moderate  number 
of  atoms  only,  as  is  typically  true  for  ensemble  forecasts  and  Markov  chain  Monte  Carlo 
samples. 

Example  5.2  (quadratic  or  Brier  score)  Let  12  =  {0,1}  and  suppose  that  <7(0,0)  = 
p(l,  1)  =  0  and  <7(0, 1)  =  <7(1,0)  =  2.  Then  (17)  recovers  the  quadratic  or  Brier  score. 

Example  5.3  (continuous  ranked  probability  score)  If  12  =  M  and  g(x,x')  =  \x  —  x'\ 

for  x,x'  G  M  in  Theorem  5.1,  we  obtain  the  continuous  ranked  probability  score  (14). 

Example  5.4  (energy  score)  If  12  =  Mm,  /3  G  (0,2)  and  g(x,xr)  =  \\x  —  x'\\13  for  x,  x'  G 
Rm,  Theorem  5.1  recovers  the  energy  score  (15). 

Example  5.5  (continuous  ranked  probability  score  for  circular  variables)  We  let 

12  =  §  denote  the  circle,  and  write  a(6, 6')  for  the  angular  distance  between  two  points 
6,6'  G  S.  Let  P  be  a  Borel  probability  measure  on  S,  and  let  0  and  0'  be  independent 
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random  variates  with  distribution  P.  By  Theorem  1  of  Gneiting  (1998),  angular  distance 
is  a  negative  definite  kernel.  Hence, 

S{P,9)  =  ^Epa(0,Q')  -  EPa{Q,9)  (19) 

defines  a  proper  scoring  rule  relative  to  the  class  of  the  Borel  probability  measures  on 
the  circle.  Grirnit  et  al.  (2005)  introduced  (19)  as  an  analogue  of  the  continuous  ranked 
probability  score  (14)  that  applies  to  directional  variables,  and  used  Fourier  analytic  tools 
to  prove  the  propriety  of  the  score. 

We  now  turn  to  a  far-reaching  generalization  of  the  energy  score.  For  x  =  (x\, . . . ,  xm)  £ 
Mm  and  a  £  (0,  oo]  define  ||x||a  =  (X)£i  \xi\a)l^a  if  a  £  (0,  oo)  and  ||x||a  =  maxi<j<m  \xi\ 
if  a  =  oo.  Schoenberg’s  theorem  (Berg,  Christensen  and  Ressel  1984,  p.  74)  and  a  strand 
of  literature  culminating  in  the  work  of  Koldobskii  (1992)  and  Zastavnyi  (1993)  imply  that 
if  a  £  (0,  oo]  and  (3  >  0,  then  the  kernel 

g(x,x')  =  \\x  —  x'W^ ,  x,x'  £  Mm, 

is  negative  definite  if  and  only  if  the  following  holds. 

Assumption  5.6  Suppose  that  either  (i)  m  =  1,  a  £  (0,  oo]  and  (3  £  (0,2];  or  (ii)  m  >  2, 
a  £  (0,2]  and  (3  £  (0,a];  or  (iii)  m  =  2,  a  £  (2, oo]  and  f3  £  (0, 1]. 

Example  5.7  (non-Euclidean  energy  score)  Under  Assumption  5.6,  the  scoring  rule 

S(P,x)  =  ^EP\\X  -  X'fa-  EP\\X  -  xfa 

is  proper  relative  to  the  class  of  the  Borel  probability  measures  P  on  Mm  for  which  the 
expectation  Ep  || X  —  X'\\^  is  finite.  If  m  =  1  or  a  =  2,  we  recover  the  energy  score;  if  m  >  2 
and  a  /  2,  we  obtain  non-Euclidean  analogues.  Section  5.2  of  Mattner  (1997)  shows  that 
if  a  >  1  then  Ep^q\\X  —  Y\\^  is  finite  if  and  only  if  Ep||X||^  and  EgHE]]^  are  such.  In 
particular,  if  a  >  1  then  Ep\\X  —  X'W1^  is  finite  if  and  only  if  Ep||A||^  is  finite. 

The  following  result  sharpens  Theorem  5.1  in  the  crucial  case  of  Euclidean  sample 
spaces  and  spherically  symmetric  negative  definite  functions.  Recall  that  a  function  r]  on 
(0,oo)  is  said  to  be  completely  monotone  if  it  possesses  derivatives  of  all  orders  and 
(— l)fc  i](k\t)  >  0  for  all  nonnegative  integers  k  and  all  t  >  0. 

Theorem  5.8  Let  ^  be  a  continuous  function  on  [0,  oo)  with  —if1  completely  monotone 
and  not  constant.  For  a  Borel  probability  measure  P  on  Mm,  let  X  and  X'  be  independent 
random  vectors  with  distribution  P.  Then  the  scoring  rule 

S(P,x)  =  ^EPil)(\\X  -  X'\\l)  -  Ep'ifiWX  -  x\\l) 

is  strictly  proper  relative  to  the  class  of  the  Borel  probability  measures  P  on  Mm  for  which 
E p  f)(\\X  —  X' W2)  is  finite. 
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The  proof  of  this  result  is  immediate  from  Theorem  2.2  of  Mattner  (1997).  In  particular, 
if  fi(t)  =  for  (3  £  (0,2),  Theorem  5.8  assures  the  strict  propriety  of  the  energy  score 
relative  to  the  class  of  the  Borel  probability  measures  P  on  Rm  for  which  -EpH-X'Ilf  is  finite. 

5.2  Inequalities  of  Hoeffding  type  and  positive  definite  kernels 

A  number  of  side  results  seem  of  independent  interest,  even  though  they  are  easy  conse¬ 
quences  of  previous  work. 

Briefly,  if  the  expectations  Ep  g(X,X')  and  Ep  g(Y,Y')  are  finite,  then  (18)  can  be 
written  as  a  Hoeffding  type  inequality, 

2  Ep.q  g(X ,  Y)  -  EP  g(X ,  X')  -  EQ  g(Y ,  Y')  >  0.  (20) 

See  Theorem  1  of  Szekely  and  Rizzo  (2005)  for  a  nearly  identical  result  and  a  converse:  if  g 
is  not  negative  definite,  then  there  are  counterexamples  to  (20),  and  the  respective  scoring 
rule  is  improper.  If  furthermore  Q  is  a  group,  and  the  negative  definite  function  g  satisfies 
g(x,x ')  =  g(—x,  —x')  for  x,x'  £  H,  a  special  case  of  (20)  can  be  stated  as 

Epg(X,-X')>EPg(X,X').  (21) 

In  particular,  if  Q  =  Mm  and  Assumption  5.6  holds,  inequalities  (20)  and  (21)  apply  and 
reduce  to 

2E\\X-  Y \\i  -  E  || A  -  X'\\P  -  E  \\Y  -  Y'fa  >  0  (22) 

and 

E\\X-  X'fa  <  E\\X  +  X'fa,  (23) 

respectively,  thereby  generalizing  results  in  Buja,  Logan,  Reeds  and  Shepp  (1994),  Szekely 
(2003)  and  Baringhaus  and  Franz  (2004). 

In  the  above  case  in  which  H  is  a  group  and  g  satisfies  g(x,  x')  =  g(—x,  —x')  for  x,  x'  £  fl, 
the  argument  leading  to  Theorem  2.3  of  Buja  et  al.  (1994)  and  Theorem  4  of  Ma  (2003) 
implies  that 

h(x,  x')  =  g(x,  —  x')  —  g(x,  x'),  x,x'e£1,  (24) 

is  a  positive  definite  kernel ,  in  the  sense  that  h  is  symmetric  in  its  arguments  and  Y^i=i  i 

didj  h(xi,Xj)  >  0  for  all  positive  integers  n,  all  ai  ,...,an  £  M,  and  all  xi,...,xn  £  IL 
Specifically,  under  Assumption  5.6, 

h(x,x')  =  \\x  +  x'\\P,  —  \\x  —  x'\\P,,  x,x'  £  Mm,  (25) 

is  a  positive  definite  kernel,  and  this  extends  and  completes  the  aforementioned  theorem  of 

Buja  et  al.  (1994). 

6  Scoring  rules  for  quantile  and  interval  forecasts 

Occasionally,  full  predictive  distributions  are  difficult  to  specify,  and  the  forecaster  might 
quote  predictive  quantiles  or  prediction  intervals  instead.  Christoffersen  (1998)  and  Bremnes 
(2004)  gave  examples  of  this  type  of  situation. 
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6.1  Proper  scoring  rules  for  quantiles 

We  consider  probabilistic  forecasts  of  a  continuous  quantity  that  take  the  form  of  predictive 
quantiles.  Specifically,  suppose  that  the  quantiles  at  the  levels  ai, . . . ,  a*,  G  (0, 1)  are  sought. 
If  the  forecaster  quotes  the  quantiles  ri, . . . ,  rk  and  x  materializes,  she  will  be  rewarded  by 
the  score  S(r±, . . . ,  rk]  x).  We  define 

S(ri,...,rk]P)  =  J S(ri,...,rk-,x)  d P(x) 

as  the  expected  score  under  the  probability  measure  P  when  the  forecaster  quotes  the 
quantiles  ri, . . .  ,rk.  To  avoid  technical  complications,  we  suppose  that  P  belongs  to  the 
convex  class  V  of  Borel  probability  measures  on  R  that  have  finite  moments  of  all  orders 
and  whose  distribution  function  is  strictly  increasing  on  R.  For  P  G  V,  let  q\, . . . ,  qk  denote 
the  true  P-quantiles  at  levels  a±, . . . ,  ak.  Following  Cervera  and  Munoz  (1996),  we  say  that 
a  scoring  rule  S  is  proper  if 

S(qi,...,qk;P)  >  S(ri,...,rk;P) 

for  all  real  numbers  r\,...,rk  and  for  all  probability  measures  P  G  V.  If  S'  is  proper,  the 
forecaster  who  wishes  to  maximize  the  expected  score  is  encouraged  to  be  honest  and  to 
volunteer  her  true  beliefs. 

To  avoid  technical  overhead,  we  tacitly  assume  P-integrability  whenever  appropriate. 
Essentially,  we  require  that  the  functions  s{x)  and  h{x )  in  (26)  and  (28)  be  P-rneasurable 
and  grow  at  most  polynomially  in  x.  We  write  l{x  <  r}  for  the  function  that  takes  the 
value  1  if  x  <  r  and  the  value  0  otherwise.  Theorem  6.1  addresses  the  prediction  of  a  single 
quantile;  Corollary  6.2  turns  to  the  general  case. 

Theorem  6.1  If  s  is  nondecreasing  and  h  is  arbitrary,  the  scoring  rule 

S(r ;  x)  =  as(r)  +  (s(x)  —  s(r))  l{x  <  r}  +  h(x )  (26) 

is  proper  for  predicting  the  quantile  at  level  a  G  (0, 1). 

Proof.  Let  q  be  the  unique  a-quantile  of  the  probability  measure  FeP.  We  identify  P 
with  the  associated  distribution  function  so  that  P(q)  =  a.  If  r  <  q  then 

S(q;  P)  —  S(r;  P)  =  f  s(x)  dP(x)  +  s(r)P(r)  —  as(r) 

'  { r,q ) 

>  s(r)(P(q)  —  P(r))  +  s(r)P(r)  —  as(r)  =  0, 

as  desired.  If  r  >  q  an  analogous  argument  applies.  ■ 

If  s(x)  =  x  and  h{x)  =  —ax,  we  obtain  the  scoring  rule 

5(r;  x)  =  (x  —  r)  (l{x  >  r}  —  a),  (27) 

which  was  proposed  by  Koenker  and  Machado  (1999),  Taylor  (1999)  and  Theis  (2005, 
p.  232)  for  measuring  in-sample  goodness-of-fit  and  out-of-sample  forecast  performance, 
respectively. 
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Corollary  6.2  If  Si  is  nondecreasing  for  i  =  1, . . . ,  k  and  h  is  arbitrary,  the  scoring  rule 

k 

S(r i,. .  .,rk;x)  =  ^  (ajS;(r)  +  (si(x)  -  s^r*))  l{x  <  rj)  +  h(x)  (28) 

i= 1 

is  proper  for  predicting  the  quantiles  at  levels  aq, . . . ,  ak  €  (0, 1). 

Cervera  and  Munoz  (1996,  pp.  515  and  519)  proved  Corollary  6.2  in  the  special  case  in 
which  each  sq  is  linear.  They  asked  whether  the  resulting  rules  are  the  only  proper  ones  for 
quantiles.  Our  results  give  a  negative  answer;  that  is,  the  class  of  proper  scoring  rules  for 
quantiles  is  considerably  larger  than  anticipated  by  Cervera  and  Munoz.  We  do  not  know 
whether  or  not  (26)  and  (28)  provide  the  general  form  of  proper  scoring  rules  for  quantiles. 

6.2  Interval  score 

Interval  forecasts  form  a  crucial  special  case  of  quantile  prediction.  We  consider  the  classical 
case  of  the  central  (1  —  a)  x  100%  prediction  interval,  whose  lower  and  upper  endpoints 
are  given  by  the  predictive  quantile  at  level  j  and  1  —  f  •  We  denote  a  scoring  rule  for  the 
associated  interval  forecast  by  Sa(l,u;x),  where  l  and  u  stand  for  the  quoted  j  and  1  —  f 
quantile,  respectively.  Hence,  if  the  forecaster  quotes  the  (1  —  a)  x  100%  central  prediction 
interval  [l,u\  and  x  materializes,  her  score  will  be  Sa(l,u;x).  Putting  aq  =  j,  ct2  =  1  —  §, 
si(x)  =  s 2(2:)  =  2)|  and  h(x)  =  —  f-  in  (28)  yields  the  interval  score , 

S^fl,  u;x)  =  —  ( (u  —  l)  H —  (l  —  x)  l{x  <  /}  H —  (x  —  u)  l{x  >  .  (29) 

V  a  a  ) 

This  scoring  rule  has  intuitive  appeal  and  —  in  the  form  of  a  utility  function  —  can  be 
traced  back  at  least  to  Dunsmore  (1968)  and  Winkler  (1972).  The  forecaster  is  rewarded 
for  narrow  prediction  intervals,  and  she  avoids  an  additional  penalty  whose  size  depends  on 
a  if  the  interval  covers  the  observation.  In  the  particular  case  a  =  0.50,  Harnill  and  Wilks 
(1995,  p.  622)  used  a  score  that  is  negatively  oriented  yet  equivalent  to  the  interval  score. 
They  noted  that  “a  strategy  for  gaming  [. . .  ]  was  not  obvious”  which  is  confirmed  by  the 
propriety  of  the  score. 

6.3  Prediction  intervals  for  a  conditionally  heteroscedastic  process 

Kabaila  (1999)  called  for  rigorous  ways  of  specifying  prediction  intervals  for  conditionally 
heteroscedastic  processes  and  proposed  a  relevance  criterion  in  terms  of  conditional  coverage 
and  width  dependence.  We  contend  that  the  notion  of  proper  scoring  rules  provides  a 
simpler,  more  general  and  more  rigorous  paradigm.  The  prediction  intervals  that  we  deem 
appropriate  derive  from  the  true  conditional  distribution,  as  implied  by  the  data  generating 
mechanism,  and  thereby  maximize  the  expected  value  of  all  proper  scores. 

To  fix  the  idea,  consider  the  stationary  bilinear  process  {Xt  :  t  £  Z}  defined  by 

Xt.+ 1  =  —  Xt  +  —  -Yt  e*  +  ( t,  (30) 
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where  the  et  are  independent  standard  normal  random  variates.  Kabaila  and  He  (2001) 
studied  central  one-step  ahead  prediction  intervals  at  the  95%  level.  The  process  is  Marko¬ 
vian,  and  the  conditional  distribution  of  Xt+\  given  Xt,Xt- 1,...  is  Gaussian  with  mean 
\ Xt  and  variance  (1  +  | Xt )2,  thereby  suggesting  the  prediction  interval 

1=  \xt  -  c  1  +  ±Xt  ,  ±Xt  +  c  1  +  ±Xt  ,  (31) 

where  c  =  4>_1(0.975).  This  interval  satisfies  the  relevance  property  of  Kabaila  (1999), 
and  Kabaila  and  He  (2001)  adopted  I  as  the  standard  prediction  interval.  We  agree  with 
this  choice,  but  we  prefer  the  aforementioned  more  direct  justification:  the  prediction  in¬ 
terval  /  is  the  standard  interval  because  its  lower  and  upper  endpoints  are  the  2.5%  and 
97.5%  percentiles  of  the  true  conditional  distribution  function,  respectively.  Kabaila  and 
He  considered  two  alternative  prediction  intervals,  namely 

J=  IK"1  (0.025),  F"1  (0.975)1,  (32) 

where  F  denotes  the  unconditional,  stationary  distribution  function  of  the  Xt,  and 

K  =  [\Xt  "  ^(l1  +  \Xi\)  \X*  +  ^(l1  +  ^|)]  >  (33) 

where 

7(„)  =  J  (21o«  (w6))172*'  if  0<S<7.36, 

{  0  if  y  >  7.36. 

This  choice  of  7  minimizes  the  expected  width  of  the  prediction  interval  under  the  constraint 
of  nominal  coverage.  However,  the  interval  forecast  (33)  seems  misguided;  it  collapses  to  a 
point  forecast  when  the  conditional  predictive  variance  is  highest. 

We  generated  a  sample  path  of  length  100  001  from  the  bilinear  process  (30)  and  consid¬ 
ered  the  interval  forecasts  (31),  (32)  and  (33),  respectively.  Table  1  summarizes  the  results 
of  this  experiment.  All  three  interval  forecasts  showed  close  to  nominal  coverage,  and  the 
prediction  interval  (33)  showed  the  smallest  average  width.  Nevertheless,  the  classical  pre¬ 
diction  interval  (31)  performed  best  in  terms  of  the  interval  score. 

6.4  Scoring  rules  for  distributional  forecasts 

Specifying  a  predictive  cumulative  distribution  function  is  equivalent  to  specifying  all  pre¬ 
dictive  quantiles;  hence,  one  can  build  scoring  rules  for  predictive  distributions  from  scoring 
rules  for  quantiles.  Matheson  and  Winkler  (1976)  and  Cervera  and  Munoz  (1996)  suggested 
ways  of  doing  this.  In  particular,  if  Sa  denotes  a  proper  scoring  rule  for  the  quantile  at 
level  a  and  v  is  a  Borel  measure  on  (0, 1),  then  the  scoring  rule 

S(F,x)  =  [  S’q(F_1(o:);  x)  z%da)  (34) 

Jo 
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Table  1:  One-step  ahead  95%  prediction  intervals  for  the  stationary  bilinear  process  (30). 
Results  of  a  simulation  study  using  100  000  interval  forecasts  each. 


Interval 

Forecast 

Empirical 

Coverage 

Average 

Width 

Average 
Interval  Score 

I 

(31) 

95.01% 

4.00 

-  9.55 

J 

(32) 

95.08% 

5.45 

-16.09 

K 

(33) 

94.98% 

3.79 

-10.64 

is  proper,  subject  to  regularity  and  integrability  constraints. 

Similarly,  one  can  build  scoring  rules  for  predictive  distributions  from  scoring  rules  for 
binary  probability  forecasts.  If  S  denotes  a  proper  scoring  rule  for  probability  forecasts  and 
v  is  a  Borel  measure  on  K,  then  the  scoring  rule 

/OO 

S(F(y),l{y>x})v(dy)  (35) 

-OO 

is  proper,  subject  to  integrability  constraints  (Matheson  and  Winkler  1976;  Gerds  2002). 
The  continuous  ranked  probability  score  (13)  corresponds  to  the  special  case  in  (35)  in  which 
S  is  the  quadratic  or  Brier  score  and  v  is  Lebesgue  measure.  This  construction  carries  over 
to  the  multivariate  case.  If  V  denotes  the  class  of  the  Borel  probability  measures  on  Rm, 
we  identify  a  probabilistic  forecast  P  £  V  with  its  cumulative  distribution  function  F.  A 
multivariate  analogue  of  the  continuous  ranked  probability  score  can  be  defined  as 


CRPS(F,  x) 


(F(V)  ~  1{»  >  U)2 


u(dy). 


This  is  a  weighted  integral  of  the  Brier  scores  at  all  to- variate  thresholds,  and  the  Borel 
measure  v  can  be  chosen  to  encourage  the  forecaster  to  concentrate  her  efforts  on  the 
important  ones.  If  v  is  finite  and  dominates  Lebesgue  measure,  this  score  is  strictly  proper 
relative  to  the  class  V. 


7  Scoring  rules,  Bayes  factors  and  random-fold  cross-validation 

We  now  relate  proper  scoring  rules  to  Bayes  factors  and  to  cross-validation,  and  propose  a 
novel  form  of  cross-validation,  random-fold  cross-validated  likelihood. 

7.1  Logarithmic  score  and  Bayes  factors 

Probabilistic  forecasting  rules  are  often  generated  by  probabilistic  models,  and  the  standard 
Bayesian  approach  to  comparing  probabilistic  models  is  by  Bayes  factors.  Suppose  we  have  a 
sample  X  =  (Xi, . . . ,  Xn)  of  values  to  be  forecast.  Suppose  also  that  we  have  two  forecasting 


22 


rules,  based  on  probabilistic  models  H\  and  H2.  So  far  in  this  paper  we  have  concentrated 
on  the  situation  where  the  forecasting  rule  is  completely  specified  before  any  of  the  Xj  is 
observed,  that  is,  there  are  no  parameters  to  be  estimated  from  the  data  being  forecast.  In 
that  situation,  the  Bayes  factor  for  Hi  against  H2  is 

_  P{X\H{) 
p{x\H2y 

where  P(X\Hk)  =  J~["=i  P{Xi\Hk)  {k  =  1,2)  (Jeffreys  1939;  Kass  and  Raftery  1995). 

Thus  if  the  logarithmic  score  is  used,  the  log  Bayes  factor  is  the  difference  of  the  scores 
for  the  two  models, 

logR  =  LogS  (HUX) -LogS  (H2,X).  (37) 

This  was  pointed  out  by  Good  (1952),  who  called  the  log  Bayes  factor  the  weight  of  evidence. 
It  establishes  two  connections.  First,  the  Bayes  factor  is  equivalent  to  the  logarithmic  score 
in  this  no-paranreter  case.  Second,  it  shows  that  the  Bayes  factor  applies  more  generally 
than  just  to  the  comparison  of  parametric  probabilistic  models,  but  also  to  the  comparison 
of  probabilistic  forecasting  rules  of  any  kind. 

So  far  in  this  paper  we  have  taken  probabilistic  forecasts  to  be  fully  specified,  but  often 
they  are  specified  only  up  to  unknown  parameters  estimated  from  the  data.  Now  suppose 
that  the  forecasting  rules  considered  are  specified  only  up  to  unknown  parameters,  6 k  for 
Hk,  to  be  estimated  from  the  data.  Then  the  Bayes  factor  is  still  given  by  (36),  but  now 
P(X\Hk)  is  the  integrated  likelihood , 


(36) 


P(X\Hk)  =  J p(X\9k,Hk)p(9k\Hk)  d ek, 

where  p(X\9k,  Hk)  is  the  (usual)  likelihood  under  model  Hk  and  p(9k\Hk)  is  the  prior  dis¬ 
tribution  of  the  parameter  9k. 

Dawid  (1984)  showed  that  when  the  data  come  in  a  particular  order,  such  as  time  order, 
the  integrated  likelihood  can  be  reformulated  in  predictive  terms: 


P(X\Hk)  =  l[P(Xt\Xt-\Hk), 

t= l 


(38) 


where  X*  1  =  {Xi, . . . ,  Xt-i},  and  P(Xt\Xt  1,Hk)  is  the  predictive  distribution  of  Xt 
given  the  past  values  under  Hk,  namely 

PiXtlX*-1,^)  =  J  p(Xt\9k,Hk)P(9k\Xt-\Hk)d9k, 

with  P(9k\Xt~1 ,  Hk)  being  the  posterior  distribution  of  9k  given  the  past  observations  X^1. 

Let  us  denote  by  Skts  the  log  integrated  likelihood,  viewed  now  as  a  scoring  rule.  It 
helps  to  view  it  as  a  scoring  rule  to  rewrite  it  as 


Sk,B  =  ^ogP(Xt\Xt-\Hk). 

t= l 
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Dawid  (1984)  showed  that  Sk.B  is  asymptotically  equivalent  to  the  plug-in  maximum  like¬ 
lihood  prequential  score 

n 

Sk,D  =  ^2logP(Xt\Xt-1,§tk-1),  (39) 

t=  1 

where  0^.  1  is  the  maximum  likelihood  estimator  (MLE)  of  9k  based  on  the  past  observa¬ 
tions,  X <_1,  in  the  sense  that  Sk,D/Sk,B  -+  1  as  n  ->  oo.  He  also  showed  that  Sk,B  is 
asymptotically  equivalent  to  the  BIC  score, 

Sk,Bic  =  jriogP(Xt\Xt-\§Z)  -  ^  log  n, 

t=  l  z 

where  dk  =  dinr(0fc),  in  the  same  sense,  namely  <S'fc,Bic/5fc,B  1  as  n  ->  oo.  This  justi¬ 
fies  the  use  of  BIC  for  comparing  forecasting  rules,  extending  the  previous  justification  of 
Schwarz  (1978),  which  related  only  to  comparing  models. 

These  results  have  two  limitations,  however.  First,  they  assume  that  the  data  come 
in  a  particular  order.  Second,  they  use  only  the  logarithmic  score,  and  not  other  scores 
that  might  be  more  appropriate  for  the  task  at  hand.  We  now  briefly  consider  how  these 
limitations  might  be  addressed. 

7.2  Scoring  rules  and  random-fold  cross-validation 

Suppose  now  that  the  data  are  unordered.  We  can  replace  (38)  by 

n 

St,B  =  ^ED[logp(Xt\x(D\Hk],  (40) 

t=  l 

where  D  is  a  random  sample  from  {1, . . . ,  f  —  l,f  +  l,...,n},  whose  size  is  a  random  variable 
that  has  a  discrete  uniform  distribution  on  {0, 1 ,n  —  1}.  Dawid’s  result  (39)  implies 
that  this  is  asymptotically  equivalent  to  the  plug-in  maximum  likelihood  version, 

n 

S*k,D  =  Y,ED[logp(Xt\X(D\d<f\Hk],  (41) 

t=  l 

where  6 ^  is  the  MLE  of  0k  based  on  X^D\ 

The  formulations  (40)  and  (41)  may  be  useful  because  they  turn  a  score  that  was  a 
sum  of  non-identically  distributed  terms  into  one  that  is  a  sum  of  identically  distributed 
exchangeable  terms.  This  opens  the  possibility  of  evaluating  S £  B  or  5|  by  Monte  Carlo, 
which  would  be  a  form  of  cross-validation.  In  this  cross-validation,  the  amount  of  data  left 
out  would  be  random  rather  than  fixed,  leading  us  to  call  it  random-fold,  cross-validation. 
Smyth  (2000)  used  the  log-likelihood  as  the  criterion  function  in  cross-validation,  as  here, 
calling  the  resulting  method  cross- validated  likelihood,  but  used  a  fixed  holdout  sample  size. 
This  general  approach  can  be  traced  back  at  least  to  Geisser  and  Eddy  (1975).  One  issue 
in  cross-validation  generally  is  how  much  data  to  leave  out,  and  different  choices  lead  to 
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different  versions  of  cross-validation,  such  as  leave-one-out,  10-fold,  and  so  on.  Considering 
versions  of  cross-validation  in  the  context  of  scoring  rules  may  shed  some  light  on  this  issue. 

We  have  seen  by  (37)  that  when  there  are  no  parameters  being  estimated,  the  Bayes 
factor  is  equivalent  to  the  difference  in  the  logarithmic  score.  Thus  one  could  replace  the 
logarithmic  score  by  another  proper  score,  and  the  difference  in  scores  could  be  viewed  as  a 
kind  of  predictive  Bayes  factor  with  a  different  type  of  score.  In  Sk,B,  Sk,D ,  S^biC;  b->  and 
S%  D  ’  we  could  replace  the  terms  in  the  sums  (each  of  which  has  the  form  of  a  logarithmic 
score)  by  another  proper  scoring  rule,  such  as  the  continuous  ranked  probability  score,  and 
we  conjecture  that  similar  asymptotic  equivalences  remain  valid. 

8  Case  study:  Probabilistic  forecasts  of  sea-level  pressure 
over  the  North  American  Pacific  Northwest 

Our  goals  in  this  case  study  are  to  illustrate  the  use  and  the  properties  of  scoring  rules  and 
to  demonstrate  the  importance  of  propriety. 

8.1  Probabilistic  weather  forecasting  using  ensembles 

Operational  probabilistic  weather  forecasts  are  based  on  ensemble  prediction  systems.  En¬ 
semble  systems  typically  generate  a  set  of  perturbations  of  the  best  estimate  of  the  current 
state  of  the  atmosphere,  run  each  of  them  forward  in  time  using  a  numerical  weather  predic¬ 
tion  model,  and  use  the  resulting  set  of  forecasts  as  a  sample  from  the  predictive  distribution 
of  future  weather  quantities  (Palmer  2002;  Gneiting  and  Raftery  2005). 

Grirnit  and  Mass  (2002)  described  the  University  of  Washington  ensemble  prediction 
system  over  the  Pacific  Northwest  which  covers  Oregon,  Washington,  British  Columbia, 
and  parts  of  the  Pacific  Ocean.  This  is  a  five-member  ensemble  that  consists  of  distinct 
runs  of  the  MM5  numerical  weather  prediction  model  with  initial  conditions  taken  from 
distinct  national  and  international  weather  centers.  We  consider  48-hour  ahead  forecasts 
of  sea-level  pressure  in  January-June  2000,  the  same  period  as  that  on  which  the  work  of 
Grirnit  and  Mass  was  based.  The  unit  used  is  the  millibar  (mb).  Our  analysis  builds  on  a 
verification  data  base  of  16  015  records  scattered  over  the  North  American  Pacific  North¬ 
west  and  the  aforementioned  six-month  period.  Each  record  consists  of  the  five  ensemble 
member  forecasts  and  the  associated  verifying  observation.  The  root-mean-square  error  of 
the  ensemble  mean  forecast  was  3.30  mb,  and  the  square  root  of  the  average  variance  of  the 
five-member  forecast  ensemble  was  2.13  mb,  resulting  in  a  ratio  of  1.55. 

This  underdispersive  behavior  —  that  is,  observed  errors  that  tend  to  be  larger  on 
average  than  suggested  by  the  ensemble  spread  —  is  typical  of  ensemble  systems  and  seems 
unavoidable,  given  that  ensembles  capture  only  some  of  the  sources  of  uncertainty  (Raftery, 
Gneiting,  Balabdaoui  and  Polakowski  2005).  To  obtain  calibrated  predictive  probability 
distributions,  it  thus  seems  necessary  to  carry  out  some  form  of  statistical  postprocessing. 
One  natural  approach  is  to  take  the  predictive  distribution  for  sea-level  pressure  at  any 
given  site  as  normal,  centered  at  the  ensemble  mean  forecast,  and  with  predictive  standard 
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Figure  1:  Probabilistic  sea- level  pressure  forecasts  over  the  North  American  Pacific  North¬ 
west  in  January- July  2000.  The  scores  are  shown  as  a  function  of  the  inflation  factor  r, 
where  the  predictive  density  is  taken  to  be  normal,  centered  at  the  ensemble  mean  fore¬ 
cast,  and  with  predictive  standard  deviation  equal  to  r  times  the  standard  deviation  of  the 
forecast  ensemble.  The  scores  were  subject  to  linear  transformations  as  detailed  in  Table  2. 


deviation  equal  to  r  times  the  standard  deviation  of  the  forecast  ensemble.  Density  forecasts 
of  this  type  were  proposed  by  Deque,  Royer  and  Stroe  (1994)  and  Wilks  (2002).  Following 
Wilks,  we  refer  to  r  as  an  inflation  factor. 

8.2  Evaluation  of  density  forecasts 

In  the  aforementioned  approach  the  predictive  density  is  Gaussian,  say  c p^rc '■  its  mean, 
/r,  is  the  ensemble  mean  forecast,  and  its  standard  deviation,  ra,  is  the  product  of  the 
inflation  factor,  r,  and  the  standard  deviation  of  the  five-member  forecast  ensemble,  a.  We 
considered  various  scoring  rules  S  and  computed  the  average  score, 

1  16  015 

s(r)  =  SiViHMiZi),  r>  0,  (42) 
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Table  2:  Probabilistic  sea-level  pressure  forecasts  over  the  North  American  Pacific  North¬ 
west  in  January-July  2000.  The  predictive  density  is  taken  to  be  normal,  centered  at 
the  ensemble  mean  forecast,  and  with  predictive  standard  deviation  equal  to  r  times  the 
standard  deviation  of  the  forecast  ensemble. 


Score 

arg  maxr  s(r ) 
in  Eqn.  (42) 

Linear  Transformation 
in  Figure  1 

Quadratic  score  (QS) 

2.18 

40s  +  6 

Spherical  score  (SphS) 

1.84 

108s  -  22 

Logarithmic  score  (LogS) 

2.41 

s  +  13 

Continuous  ranked  probability  score  (CRPS) 

1.62 

10s +  8 

Linear  score  (LinS) 

105s  -  5 

Probability  score  (PS) 

60s  -  5 

as  a  function  of  the  inflation  factor  r.  The  index  i  refers  to  the  z-th  record  in  the  verification 
data  base,  and  xt  denotes  the  value  that  materialized.  Given  the  underdispersive  character 
of  the  ensemble  system,  we  expect  s(r)  to  be  maximized  at  some  r  >  1,  possibly  near  the 
observed  ratio  r  =  1.55  of  the  root-mean-square  error  of  the  ensemble  mean  forecast  over 
the  square  root  of  the  average  ensemble  variance. 

We  computed  the  mean  score  (42)  for  inflation  factors  r  £  (0, 5)  and  for  the  quadratic 
score  (QS),  spherical  score  (SphS),  logarithmic  score  (LogS),  continuous  ranked  probability 
score  (CRPS),  linear  score  (LinS),  and  probability  score  (PS),  as  defined  in  Section  4. 
Briefly,  if  p  denotes  the  predictive  density  and  x  stands  for  the  observed  value,  then 


QS  (p,x) 

=  ip(x)  -  JZoP(y)2  dy, 

SphS(p,  x) 

=  P(x)  / (J^oP(?/)2  dy)1/2, 

LogS(p,  x) 

=  log  p(x), 

CRPS(p,  x) 

B 

1 

* 

1 

h 

1 

i-H|  CO 

II 

LinS(p,  x) 

=  P(x), 

PS (p,  x) 

=  iT-i1  P(v)  dy. 

Figure  1  and  Table  2  summarize  the  results  of  this  experiment.  The  scores  shown  in  the 
figure  are  linearly  transformed,  and  the  transformations  are  listed  in  the  right-hand  column 
of  the  table.  In  the  case  of  the  quadratic  score,  for  instance,  we  plotted  the  sum  of  40 
times  the  value  in  (42)  and  6.  Clearly,  propriety  is  preserved  under  the  transformation. 
The  quadratic  score,  spherical  score,  logarithmic  score  and  continuous  ranked  probability 
score  were  maximized  at  values  of  r  that  were  larger  than  1,  thereby  confirming  the  un¬ 
derdispersive  character  of  the  ensemble.  These  scores  are  proper.  The  linear  score  and  the 
probability  score  were  maximized  at  r  =  0.05  and  r  =  0.02,  respectively,  thereby  suggesting 
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ignorable  forecast  uncertainty  and  essentially  deterministic  forecasts.  The  latter  two  scores 
have  intuitive  appeal,  and  the  probability  score  has  been  used  to  assess  forecast  ensembles 
(Wilson,  Burrows  and  Lanzinger  1999).  However,  they  are  improper  and  their  use  may 
result  in  misguided  scientific  inferences,  as  in  this  experiment.  A  similar  comment  applies 
to  the  scores  discussed  in  Section  4.4. 

It  is  interesting  to  observe  that  the  logarithmic  score  gave  the  highest  maximizing  value 
of  r.  The  logarithmic  score  is  strictly  proper  but  involves  a  harsh  penalty  for  low  probability 
events  and  therefore  is  highly  sensitive  to  extreme  cases.  Our  verification  data  base  includes 
a  number  of  low  spread  cases  for  which  the  ensemble  variance  implodes.  The  logarithmic 
score  penalizes  the  resulting  predictions,  unless  the  inflation  factor  r  is  large.  Weigend  and 
Shi  (2000,  p.  382)  noted  similar  concerns  and  considered  the  use  of  trimmed  means  when 
computing  the  logarithmic  score.  In  our  experience,  the  continuous  ranked  probability  score 
is  less  sensitive  to  extreme  cases  or  outliers  and  provides  an  attractive  alternative. 

8.3  Evaluation  of  interval  forecasts 

The  aforementioned  predictive  densities  also  provide  interval  forecasts.  We  considered  the 
central  (1  —  a)  x  100%  prediction  interval  where  a  =  0.50  and  a  =  0.10,  respectively.  The 
associated  lower  and  upper  prediction  bounds  lj  and  Uj  are  the  %  and  1  —  f  quantiles  of 
the  normal  distribution  with  mean  //,;  and  standard  deviation  raj,  as  described  above.  We 
assessed  the  resulting  interval  forecasts  in  their  dependence  on  the  inflation  factor  r  in  two 
ways,  by  computing  the  empirical  coverage  of  the  prediction  intervals,  and  by  computing 

1  16  015 

sa(r)  =  Sa(li,uf,Xi),  r  >  0,  (43) 

i=  1 

where  Sa  denotes  the  interval  score  (29).  This  scoring  rule  assesses  both  calibration  and 
sharpness  —  the  latter  by  rewarding  narrow  prediction  intervals,  and  the  former  by  penal¬ 
izing  prediction  intervals  that  do  not  cover  the  observation.  Figure  2(a)  shows  the  empirical 
coverage  of  the  prediction  intervals.  Clearly,  the  coverage  increased  with  r.  If  a  =  0.50 
and  a  =  0.10  the  nominal  coverage  was  obtained  at  r  =  1.78  and  r  =  2.11,  respectively. 
This  confirms  the  underdispersive  character  of  the  ensemble.  Figure  2(b)  shows  the  interval 
score  (43)  as  a  function  of  the  inflation  factor  r.  If  a  =  0.50  and  a  =  0.10  the  score  was 
maximized  at  r  =  1.56  and  r  =  1.72,  respectively. 

9  Optimum  score  estimation 

Strictly  proper  scoring  rules  are  also  of  interest  in  estimation  problems,  where  they  provide 
attractive  loss  and  utility  functions  that  can  be  adapted  to  the  problem  at  hand. 

9.1  Point  estimation 

We  return  to  the  generic  estimation  problem  described  in  the  introduction.  Suppose  that  we 
wish  to  fit  a  parametric  model  Pg  based  on  a  sample  X\, . . .  ,Xn  of  identically  distributed 
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Figure  2:  Interval  forecasts  of  sea- level  pressure  over  the  North  American  Pacific  Northwest 
in  January-July  2000:  (a)  Nominal  and  actual  coverage,  and  (b)  the  interval  score  (43),  for 
the  50%  central  prediction  interval  ( a  =  0.50,  broken  line)  and  the  90%  central  prediction 
interval  (cc  =  0.10,  solid  line,  score  scaled  by  a  factor  of  10).  The  predictive  density  is 
Gaussian,  centered  at  the  ensemble  mean  forecast,  and  with  predictive  standard  deviation 
equal  to  r  times  the  standard  deviation  of  the  forecast  ensemble. 
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observations.  To  estimate  9,  we  can  measure  the  goodness-of-fit  by  the  mean  score 


1  n 

Sn(9)  =  -'£S(Pe,Xi), 

where  S'  is  a  scoring  rule  that  is  (strictly)  proper  relative  to  a  convex  class  of  probability 
measures  that  contains  the  parametric  model.  If  6 o  denotes  the  true  parameter  value, 
asymptotic  arguments  indicate  that 

argmax0<Sn($)  — ►  Qq  as  n  — ►  oo.  (44) 

This  suggests  a  general  approach  to  estimation:  Choose  a  strictly  proper  scoring  rule  S  that 
is  tailored  to  the  scientific  problem  at  hand  and  take  9n  =  argmax0<Sn(0)  as  the  optimum 
score  estimator  based  on  the  scoring  rule  S.  The  first  four  values  of  the  arg  max  in  Table 
2,  for  instance,  refer  to  the  optimum  score  estimate  for  the  inflation  factor  r  based  on 
the  logarithmic  score,  spherical  score,  quadratic  score  and  continuous  ranked  probability 
score,  respectively.  Pfanzagl  (1969)  and  Birge  and  Massart  (1993)  studied  optimum  score 
estimators  under  the  heading  of  minimum  contrast  estimators.  This  class  includes  many  of 
the  most  popular  estimators  in  various  situations  such  as  maximum  likelihood  estimators, 
least  squares  and  other  estimators  of  regression  models,  and  estimators  for  mixture  models 
or  deconvolution.  Pfanzagl  (1969)  proved  rigorous  versions  of  the  consistency  result  (44), 
and  Birge  and  Massart  (1993)  related  rates  of  convergence  to  the  entropy  structure  of 
the  parameter  space.  Maximum  likelihood  estimation  forms  the  special  case  of  optimum 
score  estimation  based  on  the  logarithmic  score,  and  optimum  score  estimation  forms  a 
special  case  of  M-estimation  (Huber  1964),  in  that  the  function  to  be  optimized  derives 
from  a  strictly  proper  scoring  rule.  When  estimating  the  location  parameter  in  a  normal 
population  with  known  variance,  for  example,  the  optimum  score  estimator  based  on  the 
continuous  ranked  probability  score  amounts  to  an  M-estimator  with  a  ^-function  of  the 
form  tjj(x)  =  2<h(^)  —  1,  where  c  is  a  positive  constant  and  denotes  the  standard  normal 
cumulative.  This  provides  a  smooth  version  of  the  ^-function  for  Huber’s  (1964)  robust 
minimax  estimator;  see  Huber  (1981,  p.  208).  Asymptotic  results  for  M-estimators,  such  as 
the  consistency  theorems  of  Huber  (1967)  and  Perlman  (1972),  then  apply  to  optimum  scores 
estimators,  too.  Wald’s  (1949)  classical  proof  of  the  consistency  of  maximum  likelihood 
estimates  relies  heavily  on  the  strict  propriety  of  the  logarithmic  score,  which  is  proved  in 
his  Lemma  1. 

The  appeal  of  optimum  score  estimation  lies  in  the  potential  adaption  of  the  scoring  rule 
to  the  problem  at  hand.  This  approach  has,  apparently,  only  very  recently  been  explored. 
Gneiting,  Raftery,  Westveld  and  Goldman  (2005)  estimated  a  predictive  regression  model 
using  the  optimum  score  estimator  based  on  the  continuous  ranked  probability  score  —  a 
choice  that  was  motivated  by  the  meteorological  problem  at  hand.  They  showed  empirically 
that  such  an  approach  can  yield  better  predictive  results  than  approaches  using  maximum 
likelihood  plug-in  estimates.  This  agrees  with  the  findings  of  Copas  (1983)  and  Friedman 
(1989)  who  showed  that  the  use  of  maximum  likelihood  and  least  squares  plug-in  estimates 
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can  be  suboptimal  in  prediction  problems.  Buja  et  al.  (2005)  proposed  the  use  of  strictly 
proper  scoring  rules  in  classification  and  class  probability  estimation  problems  and  drew 
links  to  Bayesian  techniques  as  well  as  boosting. 

9.2  Quantile  estimation 

Koenker  and  Bassett  (1978)  proposed  quantile  regression  using  an  optimum  score  estimator 
that  is  based  on  the  proper  scoring  rule  (29). 

9.3  Interval  estimation 

We  now  turn  to  interval  estimation.  Casella,  Hwang  and  Robert  (1993,  p.  141)  pointed  out 
that 


“The  question  of  measuring  optimality  (either  frequentist  or  Bayesian)  of  a  set 
estimator  against  a  loss  criterion  combining  size  and  coverage  does  not  yet  have 
a  satisfactory  answer.” 

Their  work  was  motivated  by  an  apparent  paradox  due  to  J.  O.  Berger,  which  concerns 
interval  estimators  of  the  location  parameter  9  in  a  normal  population  with  unknown  scale. 
Let  1{-}  denote  an  indicator  function.  Under  the  loss  function 

L(I;  9)  =  c\(I)  —  1{9  6  /},  (45) 

where  c  is  a  positive  constant  and  A  (I)  denotes  the  Lebesgue  measure  of  the  interval  estimate 
/,  the  classical  f-interval  is  dominated  by  a  misguided  interval  estimate  that  shrinks  to  the 
sample  mean  in  the  cases  of  the  highest  uncertainty.  Casella  et  al.  (1993,  p.  145)  commented 
that  “we  have  a  case  where  a  disconcerting  rule  dominates  a  time  honored  procedure.  The 
only  reasonable  conclusion  is  that  there  is  a  problem  with  the  loss  function.”  We  concur, 
and  we  propose  the  use  of  strictly  proper  scoring  rules  to  assess  interval  estimators  using  a 
loss  criterion  that  combines  width  and  coverage. 

Specifically,  we  contend  that  a  meaningful  comparison  of  interval  estimators  requires 
either  equal  coverage  or  equal  width.  The  loss  function  (45)  applies  to  all  set  estimates, 
regardless  of  coverage  and  size,  which  seems  unnecessarily  ambitious.  Instead,  we  focus 
attention  on  interval  estimators  with  equal  nominal  coverage  and  use  the  (negative  of  the) 
interval  score  (29).  This  loss  function  can  be  written  as 

La(I;  9)  =  \(I)  +  —  inf  \9  —  r]\,  (46) 

a  )?e/ 

and  applies  to  interval  estimates  with  upper  and  lower  exceedance  probability  j  x  100%, 
respectively.  This  approach  can,  again,  be  traced  back  to  Dunsmore  (1968)  and  Winkler 
(1972)  and  avoids  paradoxes,  as  a  consequence  of  the  propriety  of  the  interval  score.  When 
compared  to  (45),  the  loss  function  (46)  provides  a  more  flexible  assessment  of  the  coverage, 
by  taking  account  of  the  distance  between  the  interval  estimate  and  the  estimand. 
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Appendix 


Statistical  depth  functions  (Zuo  and  Serfling  2000)  provide  useful  tools  in  nonparametric 
inference  for  multivariate  data.  Specifically,  if  P  is  a  Borel  probability  measure  on  Rm,  a 
depth  function  D(P ,  x)  gives  a  P-based  center-outward  ordering  of  points  x  G  Rm.  Formally, 
this  resembles  a  scoring  rule  S(P,x)  that  assigns  a  P-based  numerical  value  to  an  event 
x  G  Rm.  Liu  (1990)  and  Zuo  and  Serfling  (1999)  list  several  desirable  properties  of  depth 
functions,  including  maximality  at  the  center,  nronotonicity  relative  to  the  deepest  point, 
affine  invariance,  and  vanishing  at  infinity.  The  latter  two  properties  do  not  appear  to 
be  defendable  requirements  for  scoring  rules;  conversely,  propriety  is  irrelevant  for  depth 
functions. 
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